Back to notes Problems January 3, 2026 48 words

Exploding Gradients

Especially in deep networks and recurrent networks, gradient can become extremely large. This can either update the model in an abrupt way or in case of Adam-like optimizers can mess up the second moment and slow down learning.

In transformers, attention spikes alo

It's solved by:

  • Gradient clipping